fix: speed up video frame injection renders#596
Conversation
|
Addressed the Linux repro finding in The first commit removed the false static video/audio cost terms, but the Linux run showed another path still forced This update now closes the failed BeginFrame calibration session, switches to screenshot mode, reruns calibration there, and sizes workers from the measured screenshot samples. It only falls back to the conservative multiplier if screenshot calibration also fails. I also removed the post-fallback one-worker clamp and added a regression test that keeps baseline auto workers when screenshot-mode calibration is cheap. Verified with the focused tests/typechecks plus a built CLI smoke render of the shortened wrapper fixture; it produced a valid 1080x1920 H.264/AAC MP4 and left no Puppeteer Chrome process behind. |
jrusso1020
left a comment
There was a problem hiding this comment.
Reviewed the follow-up commit 69dfafd1 "fix: recalibrate workers after screenshot fallback". Approving — the delta cleanly fixes the Linux fallback regression flagged in the previous round.
What changed since the prior review
The first patch made BeginFrame timeout switch the job to screenshot mode but then assumed the workload was expensive (multiplier MAX_MEASURED_CAPTURE_COST_MULTIPLIER) AND clamped workers to 1. That defeated the parallel screenshot path.
This commit:
- Drops the
switchedToScreenshotAfterCalibration && workerCount > 1 → workerCount = 1clamp. - After BeginFrame timeout, closes the previous calibration session and re-runs calibration in screenshot mode against a fresh
createCaptureSession. Real measurement, not a guess. - If screenshot calibration also fails, falls back to the conservative
MAX_MEASURED_CAPTURE_COST_MULTIPLIERbudget with reasoncalibration-screenshot-failed— the right floor. - The new test "keeps baseline auto workers after screenshot fallback when measured capture is cheap" pins the exact behavior change (
forceScreenshot=true+multiplier:1calibration → 6 workers, not 1).
The other notable behavior change
estimateCaptureCostMultiplier now ignores composition.videos / composition.audios (the +0.75 per video / +0.75 per audio terms are gone). This was the right call given the rest of the PR — measured calibration is the source of truth, and the heuristic was double-counting in cases where the measurement already saw the cost. Worth knowing it affects every render path, not just the screenshot-fallback one. The PR description calls this out.
Things I checked
- All 54 CI checks green (Tests / Lint / Format / Typecheck / regression-shards / Render on windows-latest / CLI smoke / preview-regression / perf jobs).
createCompiledFrameSrcResolvertest covers both the happy path (paths undercompiledDirget URL-encoded relative URLs, including spaces →%20) and the safety path (paths outsidecompiledDirreturnnull).- The frame-source resolver wires through both calibration and the actual capture sessions consistently (
createRenderVideoFrameInjectoris called everywhere aBeforeCaptureHookis constructed in this file). No path got missed.
Minor observations, not blocking
cfg.forceScreenshot = truemutation persists across calls. Intentional (you don't want to flip back to BeginFrame after a known-bad failure), but worth a comment if it bites later.- Full fallback-and-retry flow (BeginFrame timeout → close session → retry calibration in screenshot mode → call
resolveRenderWorkerCount) is only covered end-to-end by manual smoke. The unit-level test pins the post-fallback resolver behavior, which is the part that actually regressed; the rest is mostly orchestration. Acceptable for a hotfix.
LGTM, ship it.
— hyperframes
vanceingalls
left a comment
There was a problem hiding this comment.
Staff review notes. I think the overall direction is right, but I would hold merge on the cache-enabled path because it can silently bypass the main optimization.
- P1: the served-frame fast path is skipped when extraction cache is enabled.
packages/producer/src/services/renderOrchestrator.ts:2286-2294 now extracts into compiledDir/__hyperframes_video_frames, but still passes extractCacheDir into extractAllVideoFrames. When cache is enabled, extractAllVideoFrames() writes both cache hits and misses into lookup.entry.dir (packages/engine/src/services/videoFrameExtractor.ts:724-747), not the compiled frame directory. createCompiledFrameSrcResolver() only returns URLs for paths under compiledDir (renderOrchestrator.ts:668-685), so cached frame paths fall through to the base64 fallback in videoFrameInjector.ts:44-64.
That means HYPERFRAMES_EXTRACT_CACHE_DIR / cfg.extractCacheDir disables the PR’s headline optimization. Suggested fix: either materialize/link cached frames into the compiled render frame dir for this render, or teach the file server/resolver to safely serve the cache-root frames. Please add a regression test where extractCacheDir is set and the returned frame paths still resolve to served URLs rather than base64 data URIs.
- P2: page-close timeout can kill a pooled browser while leaving stale pooled state.
packages/engine/src/services/frameCapture.ts:717-733 force-kills the browser process if page.close() or releaseBrowser() times out. With PRODUCER_ENABLE_BROWSER_POOL, releaseBrowser() only clears pooledBrowser when the refcount reaches zero (packages/engine/src/services/browserManager.ts:217-235). If one pooled session times out closing a page while another session still holds a ref, this can kill the shared browser but leave pooledBrowser pointing at the dead/disconnected instance.
Suggested fix: move forced-kill handling into browser manager, or add a force-release path that atomically clears pooled state when the process is killed. At minimum, avoid force-killing a pooled browser from an individual session unless that session owns the last ref.
- P3: encoded frame URLs may not round-trip reserved characters in video IDs.
createCompiledFrameSrcResolver() encodes each path segment with encodeURIComponent() (renderOrchestrator.ts:681-684), while the producer file server uses c.req.path directly as the filesystem-relative path (packages/producer/src/services/fileServer.ts:449-465). The current test covers spaces, but IDs/path segments containing reserved characters like #, ?, or % can be encoded into a URL that the server then looks up literally, producing a 404.
Suggested fix: explicitly decode URL path segments in the file server before joining, while preserving containment checks, and add tests for reserved characters if those IDs are supported.
CI being green is good, and the screenshot-fallback follow-up looks correct. These issues appear to be separate from the Linux worker-count regression that was already fixed in 69dfafd1.
## Summary Fixes three issues identified in the [post-merge review](#596 (review)) of PR #596: - **P1 (cache bypass):** When `extractCacheDir` is set, extracted frames live outside `compiledDir`, so `createCompiledFrameSrcResolver` rejects them and every frame falls back to base64 data URIs. Fix: symlink cached frame directories into `compiledDir/__hyperframes_video_frames/` after extraction and remap `framePaths` so the served-frame fast path works. - **P2 (pooled browser stale state):** `closeCaptureSession` force-killed the Chrome process on timeout via raw `SIGKILL` without clearing `pooledBrowser` / `pooledBrowserRefCount`, leaving other sessions with a dead browser reference. Fix: add `forceReleaseBrowser()` in `browserManager` that atomically clears pool state before killing the process. - **P3 (reserved chars in URLs):** `createCompiledFrameSrcResolver` encodes path segments with `encodeURIComponent`, but the file server used `c.req.path` (which only applies `decodeURI`) to look up files on disk. Video IDs containing `#`, `?`, or `%` produced 404s. Fix: apply `decodeURIComponent` per path segment in the file server's catch-all route. ## Test plan - [x] `createCompiledFrameSrcResolver` tests: symlinked cache paths resolve to served URLs; cache-external paths return null; reserved characters encode correctly - [x] `forceReleaseBrowser` tests: kills process + disconnects; tolerates already-killed process - [x] `createFileServer` test: `video%231/frame.jpg` serves file from `video#1/frame.jpg` on disk - [x] Typecheck: engine + producer pass - [x] Lint + format: 0 warnings, 0 errors
Problem
A real HyperFrames wrapper project with a 19-second video asset was rendering much slower than expected on local macOS. On the reproduced fixture, auto calibration became extremely expensive, the renderer reduced itself to one worker, and the render path fell far behind an equivalent Remotion implementation.
A Linux repro later showed a second path to the same bad outcome: BeginFrame calibration can time out, switch the job to screenshot mode, and still leave the worker budget clamped to one even when screenshot capture itself is healthy.
What this fixes
Root cause
The video-frame injection path forced extracted frames through disk reads and base64 data URIs during capture. On high-resolution video-wrapper compositions, that made calibration look dramatically slower than the actual visual workload, so auto worker selection collapsed from the expected parallel path to a single worker.
Separately, BeginFrame calibration failures were treated as proof that the whole capture workload was expensive. That was too conservative after switching to screenshot mode: the fallback mode may be perfectly capable of parallel capture, but the old logic assigned multiplier
8and then hard-clamped workers to1before measuring it.Verification
Local
bunx oxlint packages/engine/src/services/frameCapture.ts packages/engine/src/services/screenshotService.ts packages/engine/src/services/videoFrameInjector.ts packages/producer/src/services/renderOrchestrator.ts packages/producer/src/services/renderOrchestrator.test.tsbunx oxfmt --check packages/engine/src/services/frameCapture.ts packages/engine/src/services/screenshotService.ts packages/engine/src/services/videoFrameInjector.ts packages/producer/src/services/renderOrchestrator.ts packages/producer/src/services/renderOrchestrator.test.tsbunx vitest run packages/producer/src/services/renderOrchestrator.test.ts packages/engine/src/config.test.tsbun run --filter @hyperframes/engine typecheckbun run --filter @hyperframes/producer typecheckbun run --filter @hyperframes/producer buildbun run --filter @hyperframes/cli build12.85son macOS screenshot mode.Browser
agent-browserto open the full patched render output in Chrome, play it through to18.9s, and verify the browser-reported video dimensions were1080x1920.Notes
Reproduced against
iswangwenbin/hyperframes-video-wrapperat commitf04c3dd.Measured on the same machine before the Linux follow-up:
workers:5: about410s.129sinternal render time, with 5 workers selected.@remotion/cli@4.0.455,OffthreadVideo, and concurrency 7: about374swall time.Follow-up Linux repro feedback showed that the initial patch did remove false video/audio worker-cost terms, but BeginFrame timeout plus screenshot fallback still dominated there. This PR now handles that path by recalibrating in screenshot mode and only using the conservative multiplier if screenshot calibration also fails.
The source fixture still warns about sparse keyframes (
max interval: 8.33s), so re-encoding the input remains recommended for reliable seeking, but this PR removes the renderer-side bottlenecks exposed by that project.